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Abstract 

In the context of a linear model with a sparse coefficient vector, exponential weights methods 
have been shown to be achieve oracle inequalities for prediction. We show that such methods 
also succeed at variable selection and estimation under the necessary identifiability condition 
on the design matrix, instead of much stronger assumptions required by other methods such as 
the Lasso or the Dantzig Selector. The same analysis yields consistency results for Bayesian 
methods and BIC-type variable selection under similar conditions. 

Keywords: Variable selection, model selection, sparse linear model, exponential weights, Gibbs 
sampler, identifiability condition. 

1 Introduction 

Consider the standard linear regression model: 

y = X/3, + z, (1) 

where y G M" is the response vector; X G M"^^ is the regression (or design) matrix, assumed to 
have normalized columns; 13^ G is the coefficient vector; and z G M" is white Gaussian noise, 
i.e., z ~ Af{0,a'^ln)- As in general the model (1) is not identifiable, we let f3^ denote one of the 
coefficient vectors such that X(3 = K{y) of minimal support size. Then and s^, denote the 
support and support size of (3^ . We are most interested in the case where the coefficient vector is 
sparse, meaning Si, is much smaller than p. As usual, we want to perform inference based on the 
design matrix X and the response vector y. The three main inference problems are: 

• Prediction: estimate the mean response vector Xf3^; 

• Estimation: estimate the coefficient vector /3^; 

• Support recovery: estimate the support Jj,. 

These problems are not always differentiated and often referred to jointly as variable/model 
selection in the statistics literature, and feature selection in the machine learning literature. Being 
central to statistics, a large number of papers address these problems. We review the literature with 
particular emphasis on papers that advanced the theory of model selection. For penalized regres- 
sion, we find (Shao, 1997), who provides necessary conditions and sufficient conditions under which 
the AIC/Mallows' Cp criteria and the BIC criteria are consistent. For example, AIC/Mallows' 
Cp are consistent when there is a unique /3 such that E(y) = XP, and this /3 has a support of 
fixed size as n,p — )■ oo. Also, BIC is consistent when the dimension p is fixed and the model is 
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identifiable — a condition that appears to be missing in that paper. BIC was recently shown in 
(Chen and Chen, 2008) to be consistent when the model is identifiable, p = 0{n°') with a < 1/2 
and the true coefficient vector has a support of fixed size as n,p — )• oo. They also propose an 
extended BIC for when a is larger. Assuming the size of the support of (3^ is known, Raskutti et al. 
(2009) establish prediction and estimation performance bounds for best subset selection, and ob- 
tain information bounds for these problems. Relaxing to the £i-norm penalty, the Lasso and the 
closely related Dantzig Selector were shown to be consistent when the design matrix satisfies a 
restricted isometric property (RIP) or has column vectors with low coherence; see (Bickel et al., 
2009; Bunea, 2008; Bunea et al., 2007; Candes and Plan, 2009; Candes and Tao, 2007; Lounici, 
2008; Mcinshausen and Yu, 2009; Zhao and Yu, 2006) among others. With a carefully chosen non- 
concave penalty, (Fan and Peng, 2004) shows that consistent variable selection is possible when 
p = 0(?7-^/^). This condition on p was weaken in the follow-up paper (Fan and Lv, 2011), though 
with an additional restriction on the coherence (Condition (16) there). The strongest results in that 
line of work seem to appear in (Zhang, 2010), which suggests a minimax concave penalty that leads 
to consistent variable selection under much weaker assumptions. The classical forward stepwise 
selection, also known as orthogonal matching pursuit, which is shown in (Cai and Wang, 2011) to 
enable variable selection under an assumption of low coherence on the design matrix. Screening was 
studied in (Fan and Lv, 2008) in the ultrahigh dimensional setting, assuming the design is random. 
A combination of screening and penalized regression is explored in (Ji and Jin, 2010; Jin et al., 
2012), with asymptotic optimality when the Gram matrix X /n is (mildly) sparse. 

A distinct line of research is the implementation of £o-penalized regression via exponential 
weights (Catoni, 2004; Dalalyan and Salmon, 2011; Dalalyan and Tsybakov, 2007; Giraud, 2007; 
Juditsky et al., 2008; Lounici, 2007; Yang, 2004). This methodology, which has precedents in 
the Bayesian literature on model selection (Chipman et al., 2001), has the potential of striking a 
good compromise between statistical accuracy and computational complexity. While computational 
tractability has only been demonstrated in simulations, a number of sharp statistical results exist 
for the prediction problem. In particular, (Alquicr and Lounici, 2011; Rigollet and Tsybakov, 2011) 
propose exponential weights procedures that achieve sharp sparsity oracle inequalities with no 
assumptions of the design matrix X. Note that there exists no result in the literature concerning 
the problems of estimation and support recovery with an exponential weights approach. For a 
recent survey of the exponential weights literature, see (Rigollet and Tsybakov, 2012). 

Our contribution is the following. We establish performance bounds for the version of expo- 
nential weights studied in (Alquier and Lounici, 2011) for the three main inference problems of 
prediction, estimation and support recovery. The methodoly developed in the present paper is new 
and brings novel and interesting results to the sparse regression literature. The main feature of 
this methodology is that it only requires comparatively almost minimum assumptions on the design 
matrix X. In particular, for estimation and support recovery, the conditions are slightly stronger 
than identifiability. Moreover, when the size of support is known, the exponential weights method 
is consistent under the minimum identifiability condition as long as the nonzero coefficients are 
large enough, close in magnitude to what is required by any method, in particular matching the 
performance of best subset selection (Raskutti et al., 2009). See also (Candes and Davenport, 2011; 
Verzelen, 2012; Zhang, 2007). An important by-product of our analysis are consistency results for 
BIC-type methods, i.e., variable selection with £o-penalty, under similar conditions, extending the 
results of Chen and Chen (2008). 

The rest of the paper is organized as follows. In Section 2, we describe in detail the methodology 
and state the main results. We also state similar results for variable selection with ^o-penalty and 
for the Bayesian model selection method of (Chipman et al., 2001). In Section 3, we compare the 
results we obtained for exponential weights with those established for other methods, in particular 
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the Lasso and MC+. In Section 4, we briefly discuss the algorithmic implementation of exponential 
weights, and show the result of some simple numerical experiments comparing exponential weights 
with other popular variable selection techniques in the literature. In Section 5, we discuss our 
results in the light of recent information bounds for model selection. The proofs of our main results 
are in Section 6. 

2 Main results 

We consider the version of exponential weights studied in (Alquier and Lounici, 2011), shown there 
to enjoy optimal oracle performance for the prediction problem. The procedure puts a sparsity prior 
on the coefficient vector and selects the estimates using the posterior distribution. We obtain a 
new prediction performance bound which is based on balancing the sparsity level and the size of the 
least squares residuals. The result does not assume any conditions on the design matrix. The task 
of support recovery, to be amenable, necessitates additional assumptions. We show that under near- 
identifiablity conditions on the design matrix, the posterior concentrates on the correct subset of 
nonzero components with overwhelming probability, provided that these coefficients are sufficiently 
large — somewhat larger than the noise level. This immediately implies that the maximum a 
posteriori (MAP) is consistent. We then derive estimation performance guarantees in Euclidean 
norm and /oo-norm for the maximum a posteriori and posterior mean. 

Throughout, we assume the noise variance a"^ is known. We also assume that p > n and remark 
that similar results hold when n > p, with p replaced by n in the bounds. 

We use some standard notation. For any u = (iii, • • • ,Ud)~^ E M'^ with d > 1 and q > 1, we 
define 



Without loss of generality, we assume from now on that the predictors are normalized in the sense 
that 



For a subset J C [p] := {l,...,p}, let Xj = [Xj,j € J] G M"^!-^!, where Xj denotes the jth 
column vector of X. For a subset J C [p], let Mj be the linear span of {Xj,j G J} and let Pj be 
the orthogonal projection onto Mj. Then, P'j := In — Pj is the orthogonal projection onto Mj- . 
We say that a vector is s-sparse if its support is of size s. 

2.1 Exponential weights 

We start with the definition of a sparsity prior on the subsets of [p], which favors subsets with 
small support. This leads to a pseudo-posterior, which is used in turn to define various exponential 
weights estimators. 

• The prior vr. Fix an upper bound s > 1 on the support size, and a sparsity parameter A > 0. 
The prior chooses the subset J C [p] with probability 





Xj||2 = 1, for all 1 < j < p. 



(2) 




(3) 



3 



• The posterior U. Given that the noise is assumed i.i.d. Gaussian with variance o"^, given a 
subset of variables J C \p], the coefficient vector that maximizes the hkehhood is the least 
squares estimate /3j with a maximum proportional to exp (— (y)||2/(2(T^)). In light of 
this, we define the following pseudo-posterior, which chooses J C [p] with probability 

n(J)cx^(J) exp(- "^^y"' ). (4) 

The prior vr enforces sparsity and focuses on subsets of size not exceeding s. Without additional 
knowledge, we shall take s = p. The exponential factor in ||Pj(y)||2 in the posterior enforces 
fidelity to the observations. Note that 11 is not a true posterior because no prior is assumed for /3^; 
we elaborate on this point in Section 2.4. The variance term 2a^ corresponds to the temperature 
T in a standard Gibbs distribution. We will calibrate the procedure via the sparsity exponent A in 
(3), though we could have done so via the temperature as well. Remember that we assume that cr^ 
is known. When the variance is unknown, we can replace it with a consistent estimator a^. 

Based on the pseudo-prior 11, it is natural to consider the maximum a posteriori (MAP) support 
estimate, defined as 

^map = argmax n(J). (5) 
J 

This leads to considering the MAP coefficient estimate. For any J C [p], let f3j denote the the 
least squares coefficient vector for the sub-model {Xj, y) with minimum Euclidean norm — so that 
/3j is unique even when the columns of Xj are linearly dependent. When the columns of Xj are 
linearly independent, the standard formula applies 

3j = [X^XjY^X^y. (6) 

Note that I(s) guarantees that (6) holds when |J| < s. The MAP coefficient estimate is then 
defined as /3„„„ = /3 r 

We found that the MAP is not as stable as the posterior mean 

3mean = En('^)'^^- 
J 

We establish results for both of them. 



2.2 Prediction 

We establish a new sparsity oracle inequality for the prediction problem. We show that, in terms 
of prediction performance, the maximum a posteriori and posterior mean come within a log factor 
of that of the oracle estimator : 

\\Xpj^ - X(3,\\2 = \\PjA\2 = Op{a^^). 

Theorem 1. Consider a design matrix X with p > n and normalized column vectors (2). Assume 
A = (62 + 12c) \ogp for some c > 0. Then with probability at least 1 — p~^, 

ll^3map - ^/3*l|2 < cr^Ss^A and ||^3mean " ^ < cr\/l2s^A. (8) 
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Note that here, and anywhere else in the paper, what is true of /Qmap is true of (3 j for any J 
such that n(J) > n(J*). 

In (Alquier and Lounici, 2011), a similar sparsity oracle inequality is established in expectation 
using the approach by Stein's Lemma from (Leung and Barron, 2006). Here, we use instead the 
concentration property of the posterior LI and show that the oracle inequality also holds true in 
probability. Note that Alquier and Lounici (2011) also established an oracle inequality in probabil- 
ity for a different exponential weights procedure that requires the knowledge of ||/3^||i. Our result 
constitutes an improvement since we do not require such knowledge. 

2.3 Concentration of the posterior and support recovery 

Our performance bounds for support recovery rely, as they should, on concentration properties of 
the posterior LI. We first prove that, without any condition on the design matrix X, the posterior 
n concentrates on subsets of small size. 

Proposition 1. Consider a design matrix X with p > n and normalized column vectors (2). For 
some £ > and c > 1, take 

A = (23 + 5c) log p. (9) 

Then, with probability at least 1 — 2p~'^, n(J) < n(J^) for all J C [p] such that \ J\ > (1 + e)s^,, and 
in fact 

n(j: |j| > (i + e)s,) <4p-^n(j,). (10) 



2.3.1 Identifiability 

Actual support recovery requires some additional conditions, the bare minimum being that the 
model is identifiable. 

Condition I(s): For any subset J C {1, • • • ,p} of size \ J\ < s, the submatrix Xj is full-rank. 

This condition characterizes the identifiability of the model as stated in the following simple result. 

Lemma 1. Assuming (3^ € W is s^-sparse, it is identifiable if, and only if, I(2sv,) is satisfied. 

In this paper, we establish that exponential weights, and also ^o-penalized variable selection, 
allow for support recovery and estimation under the condition I((2 + e)s*) for any e > fixed, as 
long as the non-zero entries of the coefficient vector are sufficiently large. In fact, I(2sv,) suffices 
when is known. 

While I(s) is qualitative, results on estimation and support recovery necessarily require a quan- 
titative measure of correlation in the covariates. The following quantity appears in the performance 
bounds we derive for exponential weights and related methods: for any integer s > 1, define 

Vs= mill rain ^^ll^J'^lb- (11) 

JC[p\:\J\<s : ||ti||2 = l V"' 

Equivalently, Ug is the smallest singular value of among submatrices of made of at most s 

columns. Note that, indeed, I(s) is equivalent to > 0. 
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2.3.2 Support recovery 

We now state the main result concerning the support recovery problem. It states that, under 
I((2 + e)s^), the posterior distribution 11 concentrates sharply on the support of — which we 
assumed to be s*-sparse — as long as A and the nonzero coefficients are sufficiently large. 

Theorem 2. Consider a design matrix X , with p > n and normalized column vectors (2), that 
satisfies Condition I((2 + e)s^) for some fixed e > 0. Assume that (9) holds and 



min 1/3^,^-1 >p:= ^ ' . (12) 
Then, with probability at least 1 — 2p~'^, n( J^) > n( J) for all J, and in fact 

n(j,) >i-4p-~'. 

Under the conditions of Theorem 2, some straightforward calculations imply that Jmap = J* 
with probability at least 1 — 6p~'^. In particular, as p — >• oo, the MAP consistently recovers the 
support of the coefficient vector. Note that the same is true for a subset drawn from 11. 

The result applies in the ultra-high dimensional setting where p is exponential in n, as long as 
the conditions are met. Characterizing design matrices X that satisfy I((2 + e)s*) in the ultra-high 
dimensional setting is an interesting open question beyond the scope of this paper. 

We mention that, if is known and we restrict the prior over subsets J of size exactly s^,, then 
the same conclusions are valid with e = and U(^2+e)s^, replaced by z/2s^ in (12), yielding consistent 
support recovery under the minimum identifiability condition 1(2^^^). In Section 3, we show that the 
Lasso estimator requires much more restrictive conditions on the design matrix and to ensure 
it selects the correct variables with high probability. 

Finally, we note that the concentration is even stronger. Under the same conditions, if 

, (l + e)(23 + 5c) + m^ 

A = logp, 

e 

then 

J2 |Jrn(j) <4p-'=n(j,). (13) 

We will use this refinement in the proof of Theorem 5. 
2.3.3 Estimation 

Armed with results for the support recovery and prediction problems, we establish corresponding 
bounds for the estimation problem. Our first result is a simple consequence of Theorem 1 and 
Proposition 1. 

Theorem 3. Consider a design matrix X with p > n and normalized column vectors (2). Assume 
A satisfies (9) with e < 1/2. Then with probability at least 1 — 3p~^, we have 



ll/^map - /3*ll2 < 



8s^A 



(2+e)s* 

We continue with bounds on the estimation error, this time in terms of the /oo-norm. Based on 
Theorem 2 (and its proof), we deduce the following. 
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Theorem 4. Let the conditions of Theorem 2 he satisfied. Then, with probability at least 1 — 7p ^, 
we have 



Il3„.p-/3.IU< J^^^^i^. (14) 

We emphasize that this estimator requires only the near minimum condition I((2 + and 
that the nonzero components of are somewhat larger than the noise level in (12) to achieve the 
optimal (up to logs) dependence on n,p of the /oo-norm estimation bound. We will develop this 
point further in our comparison with the Lasso. 

We now study the performances of the posterior mean /3jnean ^"^^ that of the following variant 

3= V n(J)3j, vj:= min ^||Xj«||2. (15) 

Define the quantity = minj(-[p] . ^.^^q ^.J-> ^^id note that z^min > 0. 
Theorem 5. Let the conditions of Theorem 2 be satisfied and let c> 1. 



1. Take A = (^+^)(^^+^'^)+^ log p. Then, with probability at least 1 — 4p 



— c 



~ 2(c + l)logp , 3 
\\(3-PJoo < cr. + 

2. If in addition I(s) is satisfied. 



a\l (20 + 4c)i^ + + z.^i„||/3JU 



n Wn 



^ / 2(c+l)logp , 3 

llPmean ~ P*lloo S Cr\ — ^ h 



cry (20 + 4c) \ y= h 1^11/3 j^lloo 



3. If in addition + 5) is satisfied and A > (62 + 4c) logp, 



m R\\ <^ 2(c+l)logp 2VWa 



2^/i; 1 



(16) 



We note that /3mean requires at least I(s). (Recall that we assume s is known such that s^ < s.) 
In practice, when the sparsity is unknown, we make a conservative choice s ^> 2s^, so that I(s) is 

substantially more restrictive that I(2s^). Typically, we assume that = O (k^) ^^"^ take s of 
this order of magnitude. We will see below in Section 2.3.4 that for Gaussian design, the condition 
I(s* + s) is satified with probability close to 1. On the other hand, the estimation result for f3 holds 
true under the near minimum condition I((2 + e)s^). For both estimators, their estimation bounds 
depend on the quantities Umin, ||^/3*||2 and ||/3^||oo which can potentially yield a sub-optimal 
rate of estimation. Note however the presence of the factor p"*^ in the bound. In particular, if the 
nonzero components of (3^ are sufficiently large, then the quantities z^min^ ^5 IIX/3JI2 and II/3JI 

00 

may be completely cancelled for a sufficiently large c > 0. If I(s^ + s) is satisfied, then we can 
derive a bound that no longer depends on ||X/3^||2 and ||/3^||oo- We will also see below that this 
bound yields the optimal rate of loo-norm estimation (up to logs) for the estimator /Jmean when the 
design matrix is Gaussian. Optimality considerations are further discussed in Section 5 based on 
recent information bounds obtained elsewhere. 
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2.3.4 Example: Gaussian design 



The quintessential example is that of a random Gaussian design, where the row vectors of X, 
denoted a^i, . . . , a;„, are independent Gaussian vectors in W with zero mean and p x p covariance 
matrix S. If we assume that S has I's on the diagonal, the resulting (random) design is just 
slightly outside our setting, since the columns vectors are not strictly normalized. Our results 
apply nevertheless. Therefore, it is of interest to lower-bound for such a design. 

We start by relating X and XI. Consider J C [p], and let Slj denote the principal submatrix 
of Xl indexed by J. By (Vershynin, 2010, Cor. 1.50 and Rem. 1.51), there is a numeric constant 
C > such that, when n > C\J\/t]'^, with probability at least 1 — 2exp(— r/^n/C), we have 



-XjXj 

n 



<^\\^j\ 



where ||-|| denotes the matrix spectral norm. When this is the case, by Weyl's theorem (Stewart and Sun, 
1990, Cor. IV.4.9), 

Amin (—X^Xj) > Amin(^j) — f?Amax(5]j), 



n 

where Amin(^) and Amax(^) denote the smallest and largest eigenvalues of a symmetric matrix A. 
Define 

7/s(s) = max ^"iax(S j) ^ ^.^ Amin(5]j). 

J:\J\<s Amin(Sj) J:\J\<s 

Assume that 

aCs logp 

for some a > 2. Then, with probability at least 1 — 2^"*^/^, 

2 

For example, in standard compressive sensing where X) is the identity matrix, we have ^/s(s) = 
As(s) = 1 for all s, in which case with high probability Ug > 1/2 when n > 2Cslogp. Consequently, 
the /oo-iiorm estimation bounds in (14) and (16) are of the order ha^\og{p) /n for some numerical 
constant 6 > 0. Again, the constants are loose in this discussion. 



2.4 Bayesian variable selection with an independence prior 

Many Bayesian techniques for model selection have proposed in the literature; see (Chipman et al., 
2001) for a comprehensive review. That same paper suggests a procedure similar to ours, except 
that it is a bonafide Bayesian model and they use the following independence sparsity prior 

7f(J) =a;l'^l(l -a;)P-l-^l, 

where uj G (0, 1) controls the sparsity level. Roughly, A for our prior corresponds to log(l — 
l/w) for this prior. It so happens that, the same arguments lead to the same results. Also, as 
argued in (Chipman ct al., 2001, Sec. 3.3), the fully-specified Bayesian model with prior /3j ~ 
A/'(0, {X"jX is very closely related to our exponential weights method. 
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2.5 £o-penalized variable selection 

Chen and Chen (2008) not only showed that BIC was consistent when p < ^/n (under some mild 
conditions on the design matrix), they also suggested a modification of the penalty term to yield 
a method that is consistent for larger values of p when the number of variables in the true (i.e., 
sparsest) model is bounded independently of n or p. 

By a simple modification of our arguments, our results for the exponential weights MAP is seen 
to apply to 

J = argmin {I — Pj)y + A| J|. 

J:\J\<s 

Consequently, our work extends that of Chen and Chen (2008) to the case where increases with p. 

3 Comparison with the Lasso and MC+ 

In this section, we compare the theorethical performances of our procedure with other well-known 
Zoo-estimation and support recovery techniques used in high-dimensional variable selection. 

3.1 Lasso 

The Lasso estimator is the solution of the convex minimization problem 

3^ = argmin - Xpg + 2Al||/3||i| , 

where = Act log (p) /n, A > and || • ||i is the /i-norm. The Lasso has received considerable 
attention in the literature over the last few years (Bach, 2008; Bunea, 2008; Bunea et al., 2007; 
Meinshausen et al., 2006; Meinshausen and Yu, 2009; Zhao and Yu, 2006). It is not our goal to 
make here an exhaustive presentation of all existing results. We refer to Chapter 4 in (Lounici, 
2009) and the references cited therein for a comprehensive overview of the literature. 

Concerning the Zoo-norm estimation and support recovery problems, the most popular as- 
sumption is the Irrepresentable Condition (Bach, 2008; Lounici, 2009; Meinshausen and Yu, 2009; 
Wainwright, 2006; Zhao and Yu, 2006) denoted from now on by IC(s^). See for instance Assump- 
tion 4.2 in (Lounici, 2009). The condition IC(s*) is strictly more restrictive than the identifiability 
I(2s^) and does not hold true in general when the columns of the design matrix X are not weakly 
correlated. Define = ||^'~^sign(/3^)||oo where ^'j, := ^Xj^Xj^. The following result is the key 
to our analysis. Let IC(s^) holds true and let the nonzero components of be sufficiently large: 

miujgj^ > Aadi,^ylog{p)/n. Then, with probability at least 1 — 2p ~~^6 — Si,p~~ , the Lasso 
solution is unique and satisfies 

cad..l^<\\r-f3J^<Cad..l^, (17) 
V n V n 

for some numerical constants C > c > that can depend only on A. See Theorem 4.1 in (Lounici, 
2009) for a more precise statement. 

We say that a Zoo -norm estimation rate is optimal if it is of the form aay/log[p)/n where a > 
is an absolute constant as in the case of gaussian sequence model (n = p and X = In the n x n 
identity matrix) . In view of the previous display, the Lasso does not attain in most cases the optimal 
Zoo-norm estimation rate. Indeed, the quantity generally depends on unless the correlations 
between the columns of the design matrix X are very weak. Consider for the instance the case 
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where miiij^fc |(^^, ^)j,k\ ^ P for some fixed p > 0. Then, we can easily find a s^-sparse vector 
such that > /osj, and the /oo-norm estimation rate of the Lasso is then suboptimal by a factor s*. 

Unhke the Lasso, our exponential weights procedure does not suffer from this limitation. Indeed, 
our procedure achieves the optimal /oo-norm estimation rate and support recovery provided that 
Condition I((2 + e)s*) holds true, which can be the case even for design matrices X with strongly 
correlated columns. 



Gaussian design. Consider the Gaussian design of Section 2.3.4, but assume now that S — -^pxp 
the pxp identity matrix. Although the design satisfies the restricted isometry with probability 
close to 1, there is no guaranteed that X also satisfies an irrepresentable condition IC(s*). Let's 
assume that this is the case for the sake of comparison. Then, we can show with probability close to 
1 that satisfies the mutual coherence condition maxj^^ < where the dependence 

on cannot be improved. Thus, we get < ^/s^ and we cannot guarantee the optimality of the 
/oo-norm estimation bound for the Lasso under the irrepresentable condition. Consequently, we 
need the condition min^gj^ ^ C a y/s^^Jog{p)/ri for some absolute constant C > in order to 
guarantee exact support recovery for the Lasso. This condition is to be compared to (12) for the 
exponential weights estimators. In that case, we have i^g > 1/2 with probability close to 1 when 
s = O {n/ logp), so that (12) becomes simply min^gj^ ^ Cay/\og{p)/n for some numerical 
constant C > 0. This condition is less restrictive than that for the Lasso by a factor ^/s^. Next, 
we note also that for a Gaussian design, the estimation bounds (14) and (16) for the exponential 
weights estimators are optimal (up to log) whereas the estimation bound for the Lasso contains the 
additional factor a/sT. 



Recently, in the framework of instrumental regression, Gautier and Tsybakov (2011) established 

for an /i-norm minimization procedure /3 close to the Dantzig selector (see (3.5) there) that with 
probability close to 1 



Dxi(3 -P.)\U<2a. 



I logp 

4 J."' 



where the sensitivity 



inf 

AeCj,:||Aj|,= 



-DxX ' XDxA 

n 



with Cj^ 



A G 



lA 



111 



< 



Y^||AjJ|i| for some < c < 1, Dx = diag(Xi*,-- - ,Xp*), 

X^^, = maxi<j<„ |^^*^| for any 1 < k < p and the X^j^^ are the components of X^- An enticing 
property of the sensitivity approach is that the quantities Kqj can be computed in reasonable 
time for small J, yielding a computationally tractable procedure to build confidence interval for 
the estimation of /3^. The downside is that without any further conditions on X, the optimal 
dependence of the bound on s-), is not clear. For instance, assume in addition that X satisfies a 
restricted eigenvalue condition as in (Bickel et al., 2009), then the dependence of the above bound 
on can be proved to be optimal for any 1 < q < 2 (See Section 9 in Gautier and Tsybakov 
(2011)). However, if we only assume that the condition I((2 + e)s^,) is satisfied, then we cannot 

establish a clear comparison between the exponential weights and /3 . The exponential weights 
estimator achieves the optimal estimation rate while the dependence on s 

bound is not explicit for (3 



D 



of the /g-norm estimation 
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3.2 MC+ 



The MC+ estimator initially proposed by Zhang (2010) is the solution of the following nonconvex 
minimization problem: 




where Xmc^I > ^^'^ the MC+ penalty function T is nonconvex, equal to outside a compact 
neighborhood of and admits a nonzero right derivative at 0. See equations (2.1)-(2.3) in (Zhang, 
2010) for more details. 

The performance of this estimator is established in Theorem 1 of (Zhang, 2010), where the tuning 
of the parameter Xmc requires the knowledge of s^, (which is d° in that paper) and the optimal 
theoretical choice of 7 is proportional to u^^ (which is d* in that paper). Let us also emphasize that 
the choice 7 oc z^J^ requires that I(s) is satisfied where s is in practice a conservative upper bound 
on s^. In addition, this quantity vg is delicate to compute in practice. Note that the exponential 
weights do not present the same limitations. Indeed, no prior knowledge of is required and we 
only need the condition I((2 + e)s*) for an arbitrarily small e > to establish the consistency of 
/^map even if the parameter s is chosen conservatively (for example, s = [n/2] if no other information 
is available). In addition, the tuning of the parameters for the exponential weights do not require 
to compute any restricted eigenvalues. We mention that the assumptions in Theorem 1 of (Zhang, 
2010) do not guarantee the identifiability of (3^. 



4 Numerical Experiments 

In this section, we illustrate the performance of the procedure (4) in variable selection and Zoo-norm 
estimation on a simulated data set. The posterior (4) is simulated via MCMC. In a nutshell, we 
construct an ergodic Markov chain {Pt)t>o with invariant probability distribution the posterior (4). 
Then, we get from (Robert and Casella, 2004) that 

1 

lim - (3t= 3mcan' - a.S. , 

t=To+l 

where Tq > is an arbitrary number. In practice, we use Tq = 3000 and T = 7000. 

We refer to (Alquier and Lounici, 2011; Rigollct and Tsybakov, 2011) for more details on the 
computational aspect. Our numerical study follows those carried out in these references except 
that we concentrate on the Zoo-norm estimation and variable selection performances of the proce- 
dure /3mean- Note that the exponential weights procedures considered in the present paper and in 
(Alquier and Lounici, 2011; Rigollct and Tsybakov, 2011) differ only through the tuning of the pa- 
rameters. (Alquier and Lounici, 2011; Rigollct and Tsybakov, 2011) consider indeed the prediction 
problem whereas we concentrate on the Zoo-norm estimation and support recovery problems, which 
require a different tuning to guarantee the theoretical consistency. Note also that /9mcan is not 
sparse since it is obtained as the expectation of the posterior 11. However, in view of Theorem 5, 
a simple thresholding of /3mean with a threshold of the order of the noise level yields consistent 
support recovery. In our simulations, we observe that few components of (3^^^^^ are significantly 
far from whereas the remaining ones are extremely small thus making the choice of the threshold 
easy in practice. From now on, we will denote indifferently by AEW the procedure /3mean *nd the 
thresholded P^ean- 
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Following the numerical experiments of (Candes and Tao, 2007), we consider the model (1) 
where X is an n x p matrix with independent standard Gaussian entries, the target vector = 
ll|j<5^} for some fixed s* > 1 and the noise variance satisfies cr^ = ||X/3^|p/(9n). For each different 
setting of {n,p, s^), we perform 100 replications of the model and compare our estimator AWE with 
other procedures in the literature on sparse estimation: 

1. The Lasso estimator; 

2. The MC+ estimator of (Zhang, 2010); 

3. The SCAD estimator of (Fan and Li, 2001). 

All these procedures are readily implemented in R. We use the glmnet package to compute the 
Lasso estimor and the ncvreg package to compute the SCAD and MC+ estimators. The AWE estima- 
tor was computed through the MCMC algorithm described in Section 7 in (Rigollct and Tsybakov, 
2011). 

Figures 1 contains the comparative boxplots for the Zoo-norm error over the 100 repetitions for 
the independent Gaussian design. Table 1 contains the average /^x^-norm error and the standard 
deviation over the 100 repetitions for the independent Gaussian design. We observe that the awe 
estimator outperforms the Lasso estimator and exhibit performances similar to MC+ and SCAD. 




Figure 1: Independent Gaussian design. Boxplots of estimation performance measure ||/3 — /9^||oo 
over 100 realizations for the aew. Lasso, MC+ and SCAD estimators. Left: {n,p,Si,) = (100,200,5). 
Right: {n,p, s^) = (200, 1000, 10). 



{n,p, s^) 


aew 


Lasso 


MC+ 


SCAD 


(100,200,5) 


0.124 


0.249 


0.137 


0.138 




(0.041) 


(0.068) 


(0.050) 


(0.056) 


(200, 1000, 10) 


0.151 


0.309 


0.153 


0.149 




(0.055) 


(0.063) 


(0.051) 


(0.050) 



Table 1: Independent Gaussian design. Means and standard deviations of performance measures 
over 100 realizations for the aew, Lasso, MC+ and SCAD estimators. 

We note in our simulation study that the four procedures always select the active covariates 
but also select non-active ones. Table 2 contains the average support recovery false positive rate 
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over the 100 repetitions for the four procedures considered in this study. We observe that the 
Lasso tends to select too many covariates as was aheady known. The (thresholded) awe estimator 
outperforms all other procedures in the support recovery problem. 





AWE 


Lasso 


MC+ 


SCAD 


(100,200,5) 


1.60 


21.55 


1.75 


3.02 


(200, 1000, 10) 


1.98 


51.88 


2.49 


5.22 



Table 2: Average support recovery false positive rate over 100 realizations for the AEW, Lasso, MC+ 
and SCAD estimators. 



5 Discussion 

We established some performance bounds for exponential weights when applied to solving the 
problems of prediction, estimation and support recovery, and deduced similar results for a slightly 
different Bayesian model selection procedure (Chipman et al., 2001) and £o-penalized (BIC-type) 
variable selection. How sharp are these bounds? We did not optimize the numerical constants 
appearing in our results, simply because we believe our bounds are loose and also because there are 
no known sharp information bounds for theses problems, except in specific cases (Jin ct al., 2012). 
That said, there are some results available in the literature (Lounici et al., 2011; Raskutti et al., 
2009; Verzelen, 2012) and our bounds come close to these. For example, from (Raskutti et al., 
2009) we learn that, when 1(25^,) holds, there is a universal constant C > such that, for any 
estimator f3 that knows s*, 

\\d-Pj2>Ca. 



I log(p/s^ 



UK. 



with probability at least 1/2, where 



max min ^^II^jmIU; (19) 

JC[p] : \J\<s \\u\\ = l y/n 



and from (Verzelen, 2012), we learn that, for another universal constant C > 0, 



V 



;y2 



Thus we see that our estimation bounds (14) and (16) come quite close to these information bounds. 
Of course, there is a trade-off with computational tractability, as computing the exponential weights 
estimates (of even approximating them) in polynomial time remains an open problem. That said, 
the numerical experiments show that these methods are promising. 



6 Proofs 

For the sake of brevity, we let || • || = || • ||2 throughout this section. 
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6.1 Proof of Theorem 1 

Define = Pj{y) — For J C [p\ with \ J\ = s, we liave 

"^-^^ - exp (X{s. -s) + -L(||p;^^(^)||2 _ \\pj(y)W^) ) (20) 



n(J.) " V ^ ^ 

with 

ll^'K^)!!' - \\PHy)f = 2^'^(0 - OJ + Ujjf - Ujf- (21) 

For the inner product on the RHS, note that G span(Xjuj^) and G span(Xj^), so that 

|2:z'^(0 - OJl = |2(Pjuj.^)^(0 - OJI < 2||PjuJ.^|| no - OJI, (22) 
by Cauchy-Schwarz's inequality. 

Lemma 2. For any c > 0, wzi/i probability at least 1 — p~'^, 

\\Pj zf < (20 + 4c)(tV| logp, VJ C [p]. (23) 



Set O = \/(20 + 4c)(|J| +s^)logp. Using Lemma 2 in (22), from (21) we have 

\\Pi{z)f-\\Pj{y)f < aOUj-ijJ + Ujf-Ujf 

< <j(IIOII + IIOJI) + IIOJP-|IOf 

< 4a2C3 + ^IIOJI^-^IIOf 

< ea^Cj-^IIOlP, (24) 

where we used the identity ab < 2a? + 6^/2 in the third inequality, and Lemma 2 to bound 
in the last inequality. 

We tackle the first part. By definition, n(Jinap) > n(J^). Take any J such that n(J) > n(J^) 
and let s = I J|. Plugging in the bound (24) into (20), and using some crude bounds, we have 



< exp (^s^ logp + A(s^-s) + 3(s + 5^(20 + 4c) logp-^l 



< exp(^s.(A + (61 + 12c)logj>) -^llOf) , 

where we used the fact that A > (62 + 12c) logp in the last inequality. This in turn implies 

Wijf < 4cr2 • (As* + (61 + 12c) logp) < Sa^A, 
and the first part of (8) follows from that. 



We now turn to the second part. Define ^7 = { J : ||^j|| > a^/TOs^}. We have 
||X3„-X/3J| < Y,Uj\\U{J) 

n(j) 

ICJII 
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< ayiOAi:5]n(J) + J^IIOII^. (25) 



By (20) and (24), we have 

llOllSg < IIOIlfe,p(A(..-») + 3C3- ^114,11^) 

< -pj-exp |^A(s^ - s) + s^logp + 3Cj - ^||€j|| j, 

where we used the fact that xe~^^ < for all x, and (J) < p**. Hence, since A > (62 + 4c) lo 

we have 

J^IIOII^TT ^ E E ^exp(A(s.-s) + s.logj. + 3C3-2A5.) 

s 

< VWa exp {-{s^ + s)(A - (61 + 12c) logp)) 
= \/l0o--2exp(-s*(A-(61 + 12c)logp)) 



The result now follows from 

a^/wXsl + 2\/l0crp~'* < VWa{y^ + 1) < f7\/l2A^, 
since p >2 and Sj, > 1, as well as A > 25. 

6.2 Proof of Proposition 1 

Remember (20). We reformulate (21) in the following way 

\\Pi{z)f-\\Pj{y)f = y-'iPj-PjJy 

= -WPjxpj^ - 2{Pjxp,, z) + z'iPj - Pjjz. 

Let J's^t = {J C \p] : \ J\ = s,\J n Ji,\ = t, J ^ Jj,}. We first bound the inner product in (27). 

Lemma 3. For any c > 0, with probability at least 1 — p~'^, 

{PjX(3,,z)^ 



WPjxpj^ 

for all J ^ Jst with t < s A s^,. 



< (10 + 2c)cr^(s V - i) logp. 



We now bound the quadratic term in (27). 
Lemma 4. For any c > 0, with probability at least 1 — p~'^, 

z^(Pj - Pjjz < (20 + Ac)a\s Vs^-t) logp, 
for all J G J's^t with t < s A s*. 

15 



For a subset J C [p], set 

^j = \\PjXl3J. (30) 

Assume that both (28) and (29) hold, which is true with probabihty at least 1 — 2p~^. Then, 
we have that, for all J € J^s,t- 

y^{Pj-PjJy < -75 + 27jav'(10 + 2c)(s V - t) logp + (20 + 4c)a^{s V - t) logp 

< {40 + 8c)a\s\/ s^-t)logp-^jj (31) 

< (40 + 8c)a^{s V s^-t) log p. (32) 

The first inequality comes from (27), (28) and (29). The identity 2ab <a^ + h'^, with a = 7j/\/2 
and h = a\J (20 + 4c) (s \l s^, — €) log^j, justifies the second inequality. 
Combining (20) and (32), we get 

E In = E E E ifexp(A(..-,) + ^«-(F,;-F,;.to) 

< E E m exp (A(s. - s) + (20 + 4c)(s - i) logp) , 



s=[(l+e)s^] t=Q 



where we used the fact that \ J's^t\ = in the last inequality. 

For the fraction of binomial coefficients, we have 



tj - tj 

We then use the standard bound on the binomial coefficient 

log Q + log (^^ ^""^^ < {s-t)\og{es/{s-t))+{s^-t)\og[e{p-s)/{s^-t)) 

< 3{sy s* -t)logp. (33) 

Hence, we have so far that 

E ^^EE-p(^.o, (34) 

J:|J|>[(1+£K] ^ s=0 t=0 

where 

Ag^t ■= w(s - t) logp + X{Si, - s), w := 23 + 4c. 
Some simple algebra yields 

'5>[(l+e)**] *=0 s>(l+£)s* t=0 

p-{\-ujlogp)es^ p(s*+l)a; logp 

- (1-p— )(l-p-'=)' ^^'^^ 
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where we used the fact that > 2, because p >2, and also — (A — logp)es^ + s*cj log p < — clogp, 
because of (9). This shows that 

n(j:|j|>[(i + eK]) < (i_p-r)"('i-p-c) "'-^>) 

s (r^n(j.). 

using the fact that w > c. From this, and the fact that p^^ < 1/2, we conclude the proof. 
6.3 Proof of Theorem 2 

Let v = i^(2-t-e)s^ for short. The proof of this result is identical to that of Proposition 1 up to (31). 
We now need a lower bound on 7j. For this, we use the following irrepresentability result. 

Lemma 5. Let X = [X1X2], with smallest singular value 6, and let P2 denote the orthogonal 
projection onto X2- Then for any (3i, 

\\{I-P2)X^M>5\\M. 

Note that for any J G J's^t with s — t < {1 + the smallest singular value of [X j_^X is 
bounded from below by ^/nu■, by Lemma 5, this implies that 

jj = \\{I-Pj){XjM\\ = \\{I-Pj){Xj^\j(3X^j)\\ > V^u\\(3l\j\\. 

Hence, 



7j > pi^\/ n{Si, — t), VJ G J's^t, such that < t < s^, A s and s < t + {1 + e)s^, (37) 

where we recall that p is defined in (12). 

In view of (31) and (37) we have, with probability at least 1 — 2p~^, for all J G J^s,t 

y^{Pj-Pj^)y < (40 + 8c)a2(s Vs,-t)logp-^7j 

< {40 + 8c)a\sV s^-t)logp-^p'^u'^n{s^-t)l{,<t+^i+,),^y (38) 

Next, we have 

_J_ _ Y n(Jl^ m. (39) 

n( J.) ^ n( J,) ^ ^ n( J,) ' ^ ^ 

The first sum in the right-hand side was already bounded in Proposition 1. We concentrate on the 
second sum. 

Combining (20) and (38), we get 

E |§ = " E ■' E E f »p (m^. - ') + i^.y^(p., - p.':>y) 



J:|J|<[(l+e)s*] ' ' s=0 t=0 JeJs,t 

[(l+e)s*] sAst (s^\ (p~s 



s=0 t=0 
[(l+e)s*]sAs 



^ E E (!)(^ _'')exp(A(s,-s) + (20 + 4c)(sVs,-t)logp-r/,,t 
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where r]s,t := ^p'^u'^n{s^ - t)I|,<i+[(i+e)5_^]}. 
Next, we use again (33) to get 

E WT)^ ^ E-p(^m), (40) 

where 

:= a;(s V - t) logp + A(s* - s) - a; := 23 + 4c. 

2 2 

Let a = Lo logp, and note that q > 2A > A + clogp by (9) and (12). 

When s < Si,, we have Ag^t = —a{s — t) — {a — A)(s* — s), so that 

Si S Si, s 

(s*-s)clogp g-a(s-i) 



5;^exp(^,,)<5]e-(--)^l°^-5] 



s=0 i=0 s=l t=0 

< 7 ^7 (41) 

- (1 - e-°)(l -p-^) ^ ^ 

When Sj, < s < (l+e)s^, we have ^<j,i = —a{Si,—t) — {X—uj logp)(s— s^), with A > logp+clogp, 
leading to 



[(1+£)S*] St CO 

^ ^exp(yl,,t)< ^ g-(s-s*)clogp^g-a(s.-t) 
s=s*+l t=0 s=s*+l t=0 



< 1 ^ (42) 

- (1 - e-")(l -p-^) ^ ^ 



Combining (36) with (39)-(42), we conclude that 
1 1 



n(JO - (l-e-")(l-p-^) (l-e-°)(l-p-^) {1 - p-^){l - p"^) 
^ 1 + 2p-^ 

using the fact that a > lo > c. From this, we get 

n( J*) > (1 - p-^fil - 2p~^) > (1 - Ip-^f > 1 - Ap-". 
This concludes the proof of Theorem 2. We note that the proof of (13) is virtually identical. 

6.4 Proof of Theorem 3 

When (9) is satisfied with e < 1/2, then A satisfies both the conditions of Proposition 1 and 
Theorem 1. Hence, with probability at least 1 — 2p~^ — p~'^ = 1 — 3p~^, we have both that 
|<^map| < (l+e)s^ and (8). Hence, the support of /Jn^^p"/^* is of size at most (l+e)sjf+s* = {2+e)s^,, 
and we have 

l|3map - /3JI < -^—\\X$^,^ - /3JII, 

with 

ll^(3map - /3JII = ll^3map " ^PJ < ay^, 

and the result follows. 
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6.5 Proof of Theorem 4 

For r > 0, we have 

P (ll3map - PJoo > r) < P - (3Joo > r, J^ap = J.) + P (||3^ap " PJoo > r, J„,ap / J.) 

< P - /3J|oo > r) + P (j^ap / J.) . 

By Theorem 2, Jmap = -^^t with probabUity at least 1 — 2p~^, so that the second term on the RHS 
is bounded by 2p~'^. 

Next, we know that (3j^ ~ N{P^,a^^'^^^) with := ^Xj^X j^, and in particular, — 
(3^j ~ M{0, cr'^Tj /n), where r? is the jth diagonal entry of This matrix being positive semi- 

definite, its diagonal terms are all bounded from above by its largest eigenvalue, which is the inverse 
of the smallest eigenvalue of which in turn is larger than i/^ . Hence, Var(/3j^j) < a'^ / {nu'^^) 
for all j G J^, so that a standard tail bound on the normal distribution and the union bound give 

Pj. - /3J|oo >r)<s, exp (^"^^ j • (43) 
Taking r = a-sj1(c + 1) log(p)/(nz^|^) bounds this by p"*^, and the desired result follows. 



6.6 Proof of Theorem 5 

We have 

l|3map-^J|oo < ^||3j-/3J|oon(J) 



< Il3j. - /3Jloon(j.) + ||3j - /3J|oon(j) 



< Il3j. -/3Jloo+ ll3j-/3Jloon(j). 

For any c > 0, we have with probability at least 1 — p~'^ ^ for any J C [p] with vj > 0, that 



(44) 



11/3 



Jiloo 



< 



< 



< 



< 



\J\ 



^/\T\ 



Xl3j\\ 

'\\Pj{z)\\ + \\Pj{Xf3,) 
cjV(20 + 4c)|J|logp+ 



where we have used Cauchy-Schwarz's inequality in the first line and (23) in the last line. 

We now assume that > 0, which implies that vj > for any J C [p] with \J\ <s. Combining 
the previous display with (43) and (44) and a union bound argument, we get with probability at 
least 1 — 2p~'^, 



map 



/3. 



'2(c+ l)logp 
a\J 



— v/(20 + 4c)logp+^| 



mil + 11/3. 



n(j). 
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Next, we combine the above display with (13) and a union bound argument to get with probabihty 
at least 1 — 4p~^ that 



li;^ II < rr 2(C+I)l0gp , 4p- 



a\l (20 + 4c)i^ + + u^\\f3J^ 

n \ n 



Note that the same reasoning applied to (3 yields the same /oo-norm estimation bound with Ug 
replaced by fmin- 

We now assume that I'si.+s > 0. Then, for any J C [p\ with \J\ <s, we have 

ii3j-/3JIoo<ii3j-/3JI< 



Combining this last inequality with (44), we get 

ll3map-/3J|oo < ||3j.-/3J|oo + ^^ Yl IIOI|n(J) + 

< Il3j. -PJoo + ^^^u{j^ \ J.) + Yl llOlin(^), 

where we recall that ij = XPj - XP^ and J = { J C [p] : ||^j|| > a^/Ws^X}. In view of Theo- 
rem 2, we have with probability at least 1 — 2p~'^ that 

n( J" \ J.) < 1 - n( J,) < 4p-^; 

and in view of (26), 

Uj\mJ) < 2VT0ap-'*. 
Combining the three last displays with (43), we get the result. 
6.7 Proofs of auxiliary results 

Lemma 2 is a special case of Lemma 4 where = 0, and we prove Lemma 4 below. 
6.7.1 Proof of Lemma 3 

First, note that uj := {Pj'X/S^jz) ~ A/'(0, (T^7j), where jj is defined in (30), so that vj := 
uj/{a^j) ~ A/'(0, 1). By the union bound and a standard tail bound on the normal distribution, 
for a > 0, we have 

JiS"5>«0 < (•';)(7_'';)exp(-aV2). 



As in (33), we have 



t \ s-t 



log ( ) + log ^* ] < (s* - t) log(es^) + {s -t) log(ep) 

< 3(s V - t)logp. (45) 
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Hence, 

P( max v] > (10 + 2c)(s Vs^-t)log]3) < exp ( - (2 + c)(s V - t) log < 
since s V s^, — t = would imply J = J^. We then apply the union bound again, 

F fmax max ^ > (10 + 2c)a^ logp] < s (s A + l)p"(^+'^) < p"'^, 

V S:* JeJs,t SV Si, —t J 

which the result we wanted. 

6.7.2 Proof of Lemma 4 

Fix J £ Js,t- First, we notice that 

z^{Pj - Pjjz = z^{Pj - Pjnj^)z - z^iPj^ - Pjnjjz < z'^{Pj - Pjnjjz, 

since Pj^ — -PjnJ* is an orthogonal projection, and therefore positive semidefinite. And Qj := 
Pj — PjnJt is also an orthogonal projection, of rank s — t, so that ~ cr'^xi-t- Chernoff's 

Bound applied to the chi-square distribution yields 

logP [xm > o) < —^{a/m — 1 — log(a/m)) < — ^, Va > 2m. 

The union bound and (45), and this tail bound, yields 

P ( max IIQj^lp > (20 + 4c)(T^(s V - t)logp ) < exp (-(2 + c)(s V - t)logp) . 

The rest of the proof is exactly the same as that of Lemma 3. 

6.8 An irrepresentability result 

We have 

\\{I-P2)XiP^f = min||Xi/3i + X2/32f 

P2 



2||^„2 
2||/a ||2 



> minJ^pll 

/32 



where /3 := {(5^,(3^), implying = + 
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